perf(v2 pipeline): parallel agents + LLM iteration cap + tighter arti…#39
Merged
Conversation
…cle prompt
Profile of one hybrid extraction on a 50-person paper-heavy repo
(deeplabcut/deeplabcut): person stage 5:06, **article stage 12:00 on a
single LLM agent invocation that fired 100+ tool calls**, membership
stage 6:37 — total ~25 min, dominated by serial waits and an
unbounded article-agent tool-call loop.
Three independent quick wins, each tunable per deployment:
1. `max_concurrent_agents` default 3→8 in the orchestrator (and the
`V2_MAX_CONCURRENT_AGENTS` env-resolver default in the API layer
raised 6→8). Person/membership stages are bottlenecked on the
`asyncio.Semaphore`, not the LLM provider — bumping the cap absorbs
wider-fanout repos without saturating RCP.
2. New `_default_usage_limits()` in `V2LLMRuntime`: caps every
agent invocation at 25 model requests + 50 tool calls via
pydantic-ai's `UsageLimits`. Without a cap the article agent could
keep cross-validating the same DOI across five tools indefinitely.
Overridable per-call (existing kw-only signature) or globally via
`V2_LLM_REQUEST_LIMIT` / `V2_LLM_TOOL_CALLS_LIMIT`. The cap turns
runaway loops into clean `LLMRuntimeError` that the per-stage
runner already handles as a per-item warning.
3. Tightened the article agent system prompt with two "stop early"
rules: emit immediately once a concrete DOI/title is found (no
cross-validation past two sources), and emit `{}` after two
consecutive empty searches instead of looping. The LLM was doing
~25 OpenAlex calls per agent invocation chasing the same paper.
Expected impact on the profiled repo: ~25 min → ~10 min wall time. No
schema, API, or routing changes — caps are tunable and conservative
defaults.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…cle prompt
Profile of one hybrid extraction on a 50-person paper-heavy repo (deeplabcut/deeplabcut): person stage 5:06, article stage 12:00 on a single LLM agent invocation that fired 100+ tool calls, membership stage 6:37 — total ~25 min, dominated by serial waits and an unbounded article-agent tool-call loop.
Three independent quick wins, each tunable per deployment:
max_concurrent_agentsdefault 3→8 in the orchestrator (and theV2_MAX_CONCURRENT_AGENTSenv-resolver default in the API layer raised 6→8). Person/membership stages are bottlenecked on theasyncio.Semaphore, not the LLM provider — bumping the cap absorbs wider-fanout repos without saturating RCP.New
_default_usage_limits()inV2LLMRuntime: caps every agent invocation at 25 model requests + 50 tool calls via pydantic-ai'sUsageLimits. Without a cap the article agent could keep cross-validating the same DOI across five tools indefinitely. Overridable per-call (existing kw-only signature) or globally viaV2_LLM_REQUEST_LIMIT/V2_LLM_TOOL_CALLS_LIMIT. The cap turns runaway loops into cleanLLMRuntimeErrorthat the per-stage runner already handles as a per-item warning.Tightened the article agent system prompt with two "stop early" rules: emit immediately once a concrete DOI/title is found (no cross-validation past two sources), and emit
{}after two consecutive empty searches instead of looping. The LLM was doing ~25 OpenAlex calls per agent invocation chasing the same paper.Expected impact on the profiled repo: ~25 min → ~10 min wall time. No schema, API, or routing changes — caps are tunable and conservative defaults.